Manish Dhakal

Khumaltar, Lalitpur,

Bagmati, Nepal

Computer Science & Machine Learning Researcher

Hello, I’m Manish Dhakal, a PhD Student at Georgia State University, under the supervision of Dr. Yi Ding. I am working as a Graduate Research Assistant (GRA) for computer science department, specializing in Computer Vision and Natural Language Processing research.

My desire to contribute to these sectors and investigate cutting-edge solutions is motivated by a sincere curiosity. I’m pursuing relevant graduate programs to further my education and give back to the ML/AI community.

News

Sep 20, 2024	TuneVLSeg has been accepted for oral presentation at ACCV, 2024.
Aug 23, 2024	Awarded with LMIC Travel Grant by MICCAI, 2024, to present our research work.
Jun 18, 2024	VLSM-Adapter research paper has been accepted for the main conference of MICCAI, 2024.
Apr 25, 2024	Joining the computer science Ph.D. program of Georgia State University (GSU) for the Fall’24, as a graduate research assistant.
Apr 06, 2024	Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models has been accepted for MIDL 2024.

Selected Publications

MICCAI
VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks

Manish Dhakal, Rabin Adhikari, Safal Thapaliya, and 1 more author

arXiv preprint arXiv:2405.06196, 2024

Abs arXiv Bib PDF Code

Foundation Vision-Language Models (VLMs) trained using large-scale open-domain images and text pairs have recently been adapted to develop Vision-Language Segmentation Models (VLSMs) that allow providing text prompts during inference to guide image segmentation. If robust and powerful VLSMs can be built for medical images, it could aid medical professionals in many clinical tasks where they must spend substantial time delineating the target structure of interest. VLSMs for medical images resort to fine-tuning base VLM or VLSM pretrained on open-domain natural image datasets due to fewer annotated medical image datasets; this fine-tuning is resource-consuming and expensive as it usually requires updating all or a significant fraction of the pretrained parameters. Recently, lightweight blocks called adapters have been proposed in VLMs that keep the pretrained model frozen and only train adapters during fine-tuning, substantially reducing the computing resources required. We introduce a novel adapter, VLSM-Adapter, that can fine-tune pretrained vision-language segmentation models using transformer encoders. Our experiments in widely used CLIP-based segmentation models show that with only 3 million trainable parameters, the VLSM-Adapter outperforms state-of-the-art and is comparable to the upper bound end-to-end fine-tuning.
@article{dhakal2024vlsm, title = {VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks}, author = {Dhakal, Manish and Adhikari, Rabin and Thapaliya, Safal and Khanal, Bishesh}, journal = {arXiv preprint arXiv:2405.06196}, year = {2024}, }
MIDL
Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

Kanchan Poudel, Manish Dhakal, Prasiddha Bhandari, and 3 more authors

arXiv preprint arXiv:2308.07706, 2024

Abs arXiv Bib PDF Code

Medical image segmentation with deep learning is an important and widely studied topic because segmentation enables quantifying target structure size and shape that can help in disease diagnosis, prognosis, surgery planning, and understanding. Recent advances in the foundation Vision-Language Models (VLMs) and their adaptation to segmentation tasks in natural images with Vision-Language Segmentation Models (VLSMs) have opened up a unique opportunity to build potentially powerful segmentation models for medical images that enable providing helpful information via language prompt as input, leverage the extensive range of other medical imaging datasets by pooled dataset training, adapt to new classes, and be robust against out-of-distribution data with human-in-the-loop prompting during inference. Although transfer learning from natural to medical images for imageonly segmentation models has been studied, no studies have analyzed how the joint representation of vision-language transfers to medical images in segmentation problems and understand gaps in leveraging their full potential. We present the first benchmark study on transfer learning of VLSMs to 2D medical images with thoughtfully collected 11 existing 2D medical image datasets of diverse modalities with carefully presented 9 types of language prompts from 14 attributes. Our results indicate that VLSMs trained in natural image-text pairs transfer reasonably to the medical domain in zero-shot settings when prompted appropriately for non-radiology photographic modalities; when finetuned, they obtain comparable performance to conventional architectures, even in X-rays and ultrasound modalities. However, the additional benefit of language prompts during finetuning may be limited, with image features playing a more dominant role; they can better handle training on pooled datasets combining diverse modalities and are potentially more robust to domain shift than the conventional segmentation models. The code and prompts are released at https://github.com/naamiinepal/medvlsm
@article{poudel2023exploring, title = {Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models}, author = {Poudel, Kanchan and Dhakal, Manish and Bhandari, Prasiddha and Adhikari, Rabin and Thapaliya, Safal and Khanal, Bishesh}, journal = {arXiv preprint arXiv:2308.07706}, year = {2024}, eprint = {2308.07706}, archiveprefix = {arXiv}, primaryclass = {cs.CV}, doi = {https://doi.org/10.48550/arXiv.2308.07706}, }